EC320, Lecture 2
06 2024
\usepackage{amsmath}
The focus of our course is regression analysis– part of the fundamental toolkit for learning from data.
The underlying theory is critical to grasp the mechanics and pitfalls
Today: Review the essential concepts from Math 243
The following review is a lot packed in very briefly though you should have learned much of it before. But that being said, it will be overwhelming for most.
Data on a variable X are a sequence of n observations, indexed by i: \{x_i: 1, \dots, n \}.
Ex. n = 5
| i | x_i |
|---|---|
| 1 | 8 |
| 2 | 9 |
| 3 | 4 |
| 4 | 7 |
| 5 | 2 |
i indicates the row number.
n is the number of rows.
x_i is the value of X for row i.
The summation operator adds a sequence of numbers over an index:
\sum_{i=1}^{n} x_i \equiv x_1 + x_2 + \dots + x_n.
The sum of x_i from 1 to n.
| i | x_i |
|---|---|
| 1 | 7 |
| 2 | 4 |
| 3 | 10 |
| 4 | 3 |
\begin{aligned} \sum_{i=1}^{4} x_i = 7 + 4 + 10 + 3 = 23 \\ \frac{1}{n} \sum_{i=1}^n x_i \rightarrow \frac{1}{4} \sum_{i=1}^4 x_i &= \ 6 \end{aligned}
The summation operator adds a sequence of numbers over an index:
\sum_{i=1}^{n} x_i \equiv x_1 + x_2 + \dots + x_n.
The sum of x_i from 1 to n.
| i | c |
|---|---|
| 1 | 2 |
| 2 | 2 |
| 3 | 2 |
| 4 | 2 |
\begin{aligned} \sum_{i=1}^{4} x_i = 7 + 4 + 10 + 3 = 23 \\ {\scriptsize \text{sample average}} \Bigg \{ \frac{1}{n} \sum_{i=1}^n x_i \rightarrow \frac{1}{4} \sum_{i=1}^4 x_i &= \ 6 \end{aligned}
For any constant c,
\sum_{i=1}^{n} c = nc.
| i | c |
|---|---|
| 1 | 2 |
| 2 | 2 |
| 3 | 2 |
| 4 | 2 |
\begin{aligned} \sum_{i=1}^{4} 2 &= 4 \times 2 \\ &= 8 \end{aligned}
For any constant c, \sum_{i=1}^{n} cx_i = c \sum_{i=1}^{n} x_i.
| i | x_i |
|---|---|
| 1 | 8 |
| 2 | 9 |
| 3 | 4 |
| 4 | 7 |
| 5 | 2 |
\begin{align*} \sum_{i=1}^{3} 2x_i &= 2\times7 + 2\times4 + 2 \times10\\ &= 14 + 8 + 20 = 42 \\ 2 \sum_{i=1}^{3} x_i &= 2(7 + 4 + 10) = 42 \end{align*}
If \{(x_i, y_i): 1, \dots, n \} is a set of n pairs, and a and b are constants, then
\sum_{i=1}^{n} (ax_i + by_i) = a \sum_{i=1}^{n} x_i + b \sum_{i=1}^{n} y_i
| i | a | x_i | b | y_i |
|---|---|---|---|---|
| 1 | 2 | 7 | 1 | 4 |
| 2 | 2 | 4 | 1 | 2 |
\begin{align} \sum_{i=1}^{2} (2x_i + y_i) &= 18 + 10 = 28 \\ 2 \sum_{i=1}^{2} x_i + \sum_{i=1}^{2} y_i &= 2 \times 11 + 6 = 28 \end{align}
The sum of the ratios is not the ratio of the sums: {\color{#81A1C1}{\sum_{i=1}^{n} x_i / y_i}} \neq \color{#B48EAD}{\left(\sum_{i=1}^{n} x_i \right) \Bigg/ \left(\sum_{i=1}^{n} y_i \right)}
Ex.
If n = 2, then \frac{x_1}{y_1} + \frac{x_2}{y_2} \neq \frac{x_1 + x_2}{y_1 + y_2}
The sum of squares is not the square of the sums: \color{#81A1C1}{\sum_{i=1}^{n} x_i^2} \neq \color{#B48EAD}{\left(\sum_{i=1}^{n} x_i \right)^2}
Ex.
If n = 2, then x_1^2 + x_2^2 \neq (x_1 + x_2)^2 = x_1^2 + 2x_1x_2 + x_2^2.
Experiment:
Any procedure that is infinitely repeatable and has a well-defined set of outcomes.
Ex. Flip a coin 10 times and record the number of heads.
Random Variable:
A variable with numerical values determined by an experiment or a random phenomenon.
Sample Space:
The set of potential outcomes an experiment could generate
Ex. The sum of two dice is an integer from 2 to 12.
Event:
A subset of the sample space or a combination of outcomes.
Ex. Rolling a two or a four.
Notation: Capital letters for random variables (e.g., X, Y, or Z) and lowercase letters for particular outcomes (e.g., x, y, or z).
Experiment
Flipping a coin.
Events:
Heads or tails.
Random Variable: (X)
Receive $1 if heads, x_i=1, pay $1 if tails, x_i=-1
Sample Space:
\{-1,1\}
A random variable that takes a countable set of values.
Bernoulli (binary) random variable
Random variable that takes values of either 1 or 0.
More generally, if P(X=1) = \theta
for some \theta \in [0,1]
then
P(X=0) = 1 - \theta
We describe a discrete random variable by listing its possible values with associated probabilities.
If X takes on k possible values \{x_1, \dots, x_k\}, then the probabilities p_1, p_2, \dots, p_k are defined by p_j = P(X=x_j), \quad j = 1,2, \dots, k, where p_j \in [0,1] and p_1 + p_2 + \dots + p_k = 1.
Probability density function (pdf)
The pdf of X summarizes possible outcomes and associated probabilities:
f(x_j)=p_j, \quad j=1,2,\dots,k.
Ex. 2020 Presidential election: 538 electoral votes at stake.
Basketball player goes to the foul line to shoot two free throws.
Use the pdf to calculate the probability of the event that the player makes at least one shot, i.e., P(X \geq 1).
P(X \geq 1) = P(X=1) + P(X=2)= 0.4 + 0.3 = 0.7
A random variable that takes any real value with zero probability.
Wait, what?! The variable takes so many values that we can’t count all possibilities, so the probability of any one particular value is zero.
Measurement is discrete (e.g., dollars and cents), but variables with many possible values are best treated as continuous.
Probability density functions also describe continuous random variables.
Difference between continuous and discrete PDFs
Function that represents all outcomes of a random variable and the corresponding probabilities.
Key Takeaway: The shape of a distribution provides valuable information
The probability density function of a variable uniformly distributed between 0 and 2 is
f(x) = \begin{cases} \frac{1}{2} & \text{if } 0 \leq x \leq 2 \\ 0 & \text{otherwise} \end{cases}
By definition, the area under f(x) is equal to 1.
The shaded area illustrates the probability of the event 1 \leq X \leq 1.5.
P(1 \leq X \leq 1.5) = (1.5-1) \times0.5 = 0.25
The “bell curve”
The shaded area illustrates the probability of the event -2 \leq X \leq 2.
Continuous distribution where x_i takes the value of any real number ({\mathbb{R}})
Rule 1: The probability that the random variable takes a value x_i is 0 for any x_i\in {\mathbb{R}}
Rule 2: The probability that the random variable falls between [x_i,x_j] range, where x_i \neq x_j, is the area under p(x) between those two values
The area above represents p(x)=0.95. The values \{-1.96, 1.96\} represent the 95% confidence interval for \mu.
Quantitative measures used to describe the shape and characteristics of a probability distribution1
Summarize and understand the important features of a distribution
First moment: Mean
Second moment: Variance
Third moment: Skewness
Fourth moment: Kurtosis
\quad \quad \quad \vdots
Describes the central tendency of distribution in a single number.1
Density functions describe the entire distribution, but sometimes we just want a summary.
Other summary statistics we may be interested in include
The expected value of a discrete random variable X is the weighted average of its k values \{x_1, \dots, x_k\} and their associated probabilities:
\begin{aligned} E(X) &= x_1 P(x_1) + x_2 P(x_2) + \dots +x_k P(x_k) \\ &= \sum_{j=1}^{k} x_jP(x_j). \end{aligned}
AKA: Population mean
Rolling a six-sided die once can take values \{1, 2, 3, 4, 5, 6\}, each with equal probability. What is the expected value of a roll?
\begin{align*} E(\text{Roll}) = 1 \times \frac{1}{6} &+ 2 \times \frac{1}{6} + 3 \times \frac{1}{6} + 4 \times \frac{1}{6} \\ &+ 5 \times \frac{1}{6} + 6 \times \frac{1}{6} = {3.5} \end{align*}
Note: The EV can be a number that isn’t a possible outcome of X.
If X is a continuous random variable and f(x) is its probability density function, then the expected value of X1 is
E(X) = \int_{-\infty}^{\infty} x f(x) dx.
For any constant c, E(c) = c. Ex.
For any constants a and b, E(aX + b) = aE(X) + b.
Ex. Suppose X is the high temperature in degrees Celsius in Eugene during August. The long-run average is E(X) = 28. If Y is the temperature in degrees Fahrenheit, then Y = 32 + \frac{9}{5} X. What is \color{#b48ead}{E(Y)}?
E(Y) = 32 + \frac{9}{5} E(X) = 32 + \frac{9}{5} \times 28 = \color{#b48ead}{82.4}
If \{a_1, a_2, \dots , a_n\} are constants and \{X_1, X_2, \dots , X_n\} are random variables, then
{\scriptsize \color{#4c566a}{E(a_1 X_1 + a_2 X_2 + \dots + a_n X_n)} = \color{#81A1C1}{a_1 E(X_1) + a_2 E(X_2) + \dots + a_n E(X_n)}}
In English, the expected value of the sum = the sum of expected values.
The expected value of the sum = the sum of expected values.
Ex. Suppose that a coffee shop sells X_1 small, X_2 medium, and X_3 large caffeinated beverages in a day. The quantities sold are random with expected values E(X_1) = 43, E(X_2) = 56, and E(X_3) = 21. The prices of small, medium, and large beverages are 1.75, 2.50, and 3.25 dollars. What is expected revenue?
\begin{align*} \color{#4c566a}{\scriptsize E(1.75 X_1 + 2.50 X_2 + 3.35 X_3)} &= \color{#81A1C1}{\scriptsize 1.75 E(X_1) + 2.50 E(X_2) + 3.25 E(X_3)} \\ &= \color{#b48ead}{\scriptsize 1.75(43) + 2.50(56) + 3.25(21)} \\ &= \color{#b48ead}{\scriptsize 283.5} \end{align*}
Previously, we found that the expected value of rolling a six-sided die is E \left(\text{Roll} \right) = 3.5.
Is \left[E \left( \text{Roll} \right) \right]^2 the same as E \left(\text{Roll}^2 \right)?
\begin{align*} E \left( \text{Roll}^2 \right) &= 1^2 \times \frac{1}{6} + 2^2 \times \frac{1}{6} + 3^2 \times \frac{1}{6} + 4^2 \times \frac{1}{6} \\ &\quad \qquad \qquad + 5^2 \times \frac{1}{6} + 6^2 \times \frac{1}{6} \\ &\approx 15.167 \neq 12.25. \end{align*}
No!
Except in special cases, the transformation of an expected value is note the expected value of a transformed random variable.
For some function g(\cdot), it is typically the case that
\color{#4c566a}{g \left( E(X) \right)} \neq \color{#81A1C1}{E \left( g(X) \right)}.
Random variables \color{#b48ead}{X} and \color{#81A1C1}{Y} share the same population mean, but are distributed differently.
Tells us how far X deviates from \mu, on average:
\mathop{\text{Var}}(X) \equiv E\left[ (X - \mu)^2 \right] = \sigma^2_X
Where: \mu = E(X).
How tightly is a random variable distributed about its mean?
Describe the distance of X from its population mean \mu as the squared difference: (X - \mu)^2.
\mathop{\text{Var}}(X) = 0 \iff X is a constant.
Wait what? How can a random variable be a constant?? Because a constant fits the technical definition of a random variable1. It’s just not-so-random
For any constants a and b, \mathop{\text{Var}}(aX + b) = a^2\mathop{\text{Var}}(X).
Ex. Suppose X is the high temperature in degrees Celsius in Eugene during August. If Y is the temperature in degrees Fahrenheit, then Y = 32 + \frac{9}{5} X. What is \color{#81A1C1}{\mathop{\text{Var}}(Y)}?
\mathop{\text{Var}}(Y) = (\frac{9}{5})^2 \mathop{\text{Var}}(X) = \color{#81A1C1}{\frac{81}{25} \mathop{\text{Var}}(X)}
The positive square root of the variance:
\mathop{\text{sd}}(X) = +\sqrt{\mathop{\text{Var}}(X)} = \sigma
Rule 01: For any constant c, \mathop{\text{sd}}(c) = 0.
Rule 02: For any constants a and b, \mathop{\text{sd}}(aX + b) = \left| a \right|\mathop{\text{sd}}(X).
Note: The same as variance, almost
When we’re working with a random variable X with an unfamiliar scale, it is useful to standardize it by defining a new variable Z:
Z \equiv \frac{X - \mu}{\sigma}
Z has mean 0 and standard deviation 1. How?
For two random variables X and Y, the covariance is defined as the expected value (or mean) of the product of their deviations from their individual expected values:
\mathop{\text{Cov}}(X, Y) \equiv E \left[ (X - \mu_X) (Y - \mu_Y) \right] = \sigma_{XY}
Idea: Characterize the relationship between random variables X and Y.
Positive correlation: When \sigma_{XY} > 0, then X is above its mean when Y is above its mean, on average.
Negative correlation: When \sigma_{XY} < 0, then X is below its mean when Y is above its mean, on average.
Statistical independence:
If X and Y are independent, then E(XY) = E(X)E(Y).
Caution:
\mathop{\text{Cov}}(X, Y) = 0 does not imply that X and Y are independent.
\mathop{\text{Cov}}(X, Y) = 0 means that X and Y are uncorrelated.
For any constants a, b, c, and d,
\mathop{\text{Cov}}(aX + b, cY + d) = ac\mathop{\text{Cov}}(X, Y)
A problem with covariance is that it is sensitive to units of measurement.
The correlation coefficient solves this problem by rescaling the covariance:
\mathop{\text{Corr}}(X,Y) \equiv \frac{\mathop{\text{Cov}}(X,Y)}{\mathop{\text{sd}}(X) \times \mathop{\text{sd}}(Y)} = \frac{\sigma_{XY}}{\sigma_X \sigma_Y}.
Also denoted as \rho_{XY}.
-1 \leq \mathop{\text{Corr}}(X,Y) \leq 1
Invariant to scale: if I double Y, \mathop{\text{Corr}}(X,Y) will not change.
Perfect positive correlation: \mathop{\text{Corr}}(X,Y) = 1.
Perfect negative correlation: \mathop{\text{Corr}}(X,Y) = -1.
Positive correlation: \mathop{\text{Corr}}(X,Y) > 0.
Negative correlation: \mathop{\text{Corr}}(X,Y) < 0.
No correlation: \mathop{\text{Corr}}(X,Y) = 0.
For constants a and b,
\mathop{\text{Var}} (aX + bY) = a^2 \mathop{\text{Var}}(X) + b^2 \mathop{\text{Var}}(Y) + 2ab\mathop{\text{Cov}}(X, Y).
If X and Y are uncorrelated, then \mathop{\text{Var}} (X + Y) = \mathop{\text{Var}}(X) + \mathop{\text{Var}}(Y)
If X and Y are uncorrelated, then \mathop{\text{Var}} (X - Y) = \mathop{\text{Var}}(X) + \mathop{\text{Var}}(Y)
Why do we estimate things? Because we can’t measure everything
Suppose we want to know the average height of the population in the US
How can we use these data to estimate the height of the population?
Estimand:
Quantity that is to be estimated in a statistical analysis
Estimator:
A rule (or formula) for estimating an unknown population parameter given a sample of data.
Estimate:
A specific numerical value that we obtain from the sample data by applying the estimator.
Suppose we want to know the average height of the population in the US
Estimand: The population mean (\mu)
Estimator: The sample mean (\bar{X})
\bar{X} = \dfrac{1}{n} \sum_{i=1}^n X_i
Estimate: The sample mean (\hat{\mu} = 5\text{'}6\text{''})
Imagine that we want to estimate an unknown parameter \mu, and we know the distributions of three competing estimators. Which one should we use?
Question: What properties make an estimator reliable?
Answer (1): Unbiasedness
On average, does the estimator tend toward the correct value?
More formally: Does the mean of estimator’s distribution equal the parameter it estimates?
\mathop{\text{Bias}_\mu} \left( \hat{\mu} \right) = E\left[ \hat{\mu} \right] - \mu
Question What properties make an estimator reliable?
A01: Unbiasedness
Unbiased estimator: E\left[ \hat{\mu} \right] = \mu
Biased estimator E\left[ \hat{\mu} \right] \neq \mu
Is the sample mean \frac{1}{n} \sum_{i=1}^n x_i = \hat{\mu} an unbiased estimator of the population mean E(x_i) = \mu?
\begin{aligned} E\left[ \hat{\mu} \right] &= E\left[ \frac{1}{n} \sum_{i=1}^n x_i \right] \\ &=\frac{1}{n} \sum_{i=1}^nE\left[ x_i \right] \quad \big\} \quad \text{rule 3} \\ &=\frac{1}{n} \sum_{i=1}^n \mu \quad \quad \ \ \ \big\} \quad \text{by definition} \\ &= \mu \end{aligned}
Question What properties make an estimator reliable?
A02: Efficiency (low variance)
The central tendencies (means) of competing distributions are not the only things that matter. We also care about the variance of an estimator.
\mathop{\text{Var}} \left( \hat{\mu} \right) = E\left[ \left( \hat{\mu} - E\left[ \hat{\mu} \right] \right)^2 \right]
Lower variance estimators estimate closer to the mean in each sample
Question: What properties make an estimator reliable?
A02: Efficiency (low variance)
Should we be willing to take a bit of bias to reduce the variance
In economics/causal inference we emphasize unbiasedness
In addition to the sample mean, there are several other unbiased estimators we will use often.
Sample variance estimates variance \sigma^2.
Sample covariance estimates covariance \sigma_{XY}.
Sample correlation estimates the pop. correlation coefficient \rho_{XY}.
Sample variance, S_X^2, is an unbiased estimator of the pop. variance \sigma^2
S_{X}^2 = \dfrac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2.
Sample covariance, S_{XY}, is an unbiased estimator of the pop. covariance, \sigma_{XY}
S_{XY} = \dfrac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}).
Sample correlation r_{XY} is an unbiased estimator of the pop. correlation coefficient \rho_{XY}
r_{XY} = \dfrac{S_{XY}}{\sqrt{S_X^2} \sqrt{S_Y^2}}.
Population:
A group of items or events we would like to know about.
Ex. Americans, games of chess, cats in Eugene, etc.
Parameter1
a value that describes that population
Ex. Mean height of American, average length of a chess game, median weight of the kitties
Sample:
A survey of a subset of the population.
Ex. Respondents to a survey, random sample of econ students at the UO
Often we aim to draw observations randomly from the population
Focus: Populations vs Samples
Challenge: Usually missing data of the entire population.
Solution: Sample from the population and estimate the parameter.
There are myriad ways to produce a sample,1 but we will restrict our attention to simple random sampling, where
Each observation is a random variable.
The n random variables are independent.
Life becomes much simpler for the econometrician.
Question: Why do we care about population vs. sample?
Question: Why do we care about population vs. sample?
Question: Why do we care about population vs. sample?
Question: Why do we care about population vs. sample?
Let’s repeat this 10,000 times and then plot the estimates.
(This exercise is called a Monte Carlo simulation.)
# Set the seed
set.seed(12468)
# Set population and sample sizes
n_p <- 100
n_s <- 10
# Generate data
pop_df <- tibble(
x = rnorm(n_p, mean = 2, sd = 20)
)
# Simulation
sim_df <- parallel::mclapply(mc.cores = 4, X = 1:1e4, FUN = function(x, size = n_s) {
pop_df %>%
sample_n(size = size) %>%
summarize(mu_hat = mean(x))
}) %>% do.call(rbind, .) %>% as_tibble()
# Create histogram of simulation
ggplot(data = sim_df, aes(mu_hat)) +
geom_histogram(binwidth = 1, fill = hii, color = "white", size = 0.25, alpha = 0.6) +
geom_vline(xintercept = m0, size = 2, color = hi) +
scale_x_continuous(breaks = m0) +
scale_y_continuous(expand = c(0, 0), limits = c(0, NA))+
theme(axis.text.x = element_text(size = 20),
axis.text.y = element_blank(),
rect = element_blank(),
axis.title.y = element_blank(),
axis.title.x = element_text(size = 20, hjust = 1, color = hi),
line = element_blank())Question: Why do we care about population vs. sample?
On average, the mean of the samples are close to the population mean
Question: Why do we care about population vs. sample?
Answer: Uncertainty matters.
Consider the following argument (this slide scrolls down)
Suppose we have some estimator \hat{\theta} for a parameter \theta:
We can say
if \theta really was 2.5, then the probability of getting \hat{\theta} = 45 is super super low. Thus the probability that \theta is actually 2.5 is super super low”.
But what distribution should we be assuming?
Theorem
Let x_1, x_2, \dots, x_n be a random sample from a population with mean E\left[ X \right] = \mu and variance \text{Var}\left( X \right) = \sigma^2 < \infty, let \bar{X} be the sample mean. Then, as n\rightarrow \infty, the function \frac{\sqrt{n}\left(\bar{X}-\mu\right)}{S_x} converges to a Normal Distribution with mean 0 and variance 1.
Some interesting YouTube links:
There are two broad types of data
Data generated in controlled, laboratory settings1
Ideal for causal identification, but difficult to obtain
There are two broad types of data
Experimental data
Observational data
Data generated in non-experimental settings
Types of observational data:
Commonly used though poses challenges to causal identification
Sample of individuals from a population at a point in time
Ideally collected using random sampling
Note: Used extensively in applied microeconomics1 and is the main focus of this course
Observations of variables over time
Ex.
Complication: Observations are not independent draws
More advanced methods needed1
Cross sections from different points in time
Useful for studying relationship that change over time.
Again, requires more advanced methods1
Time series for each cross sectional unit
Ex. Daily attendance across my class
Can control for unobserved characteristics
Again, requires more advanced methods1
Analysis ready dataset are rare. Most data are messy
Data wrangling is a non-trivial part of an economist or data scientist/analyst’s job
has a suite of packages that facilitate data wrangling:
tidyverse: readr, tidyr, dplyr, ggplot2 + othersThe variance of a random variable X is defined as:
\text{Var}(X) = E[(X - \mu_X)^2]
\text{Cov}(X, Y) is defined as:
\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]
For two random variables X and Y, the variance of their sum X + Y is:
\text{Var}(X + Y) = E[((X + Y) - (\mu_X + \mu_Y))^2]
Expanding the squared term, we get:
\begin{align*} \text{Var}(X + Y) &= E[(X - \mu_X + Y - \mu_Y)^2] \\ &= E[(X - \mu_X)^2 + 2(X - \mu_X)(Y - \mu_Y) + (Y - \mu_Y)^2] \\ &= E[(X - \mu_X)^2] + E[2(X - \mu_X)(Y - \mu_Y)] + E[(Y - \mu_Y)^2] \\ &= \text{Var}(X) + 2\text{Cov}(X, Y) + \text{Var}(Y) \end{align*}
If X and Y are uncorrelated, then \text{Cov}(X, Y) = 0, and the above simplifies to:
\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)
Similarly, the variance of the difference X - Y is:
\text{Var}(X - Y) = E[((X - Y) - (\mu_X - \mu_Y))^2]
Expanding the squared term, just like before:
\begin{align*} \text{Var}(X - Y) &= E[(X - \mu_X - (Y - \mu_Y))^2] \\ &= E[(X - \mu_X)^2 - 2(X - \mu_X)(Y - \mu_Y) + (Y - \mu_Y)^2] \\ &= \text{Var}(X) - 2\text{Cov}(X, Y) + \text{Var}(Y) \end{align*}
Again, if X and Y are uncorrelated, \text{Cov}(X, Y) = 0, and we have:
\text{Var}(X - Y) = \text{Var}(X) + \text{Var}(Y)
EC320, Lecture 2 | Statistics Review